114 research outputs found
Topic Identification for Speech without ASR
Modern topic identification (topic ID) systems for speech use automatic
speech recognition (ASR) to produce speech transcripts, and perform supervised
classification on such ASR outputs. However, under resource-limited conditions,
the manually transcribed speech required to develop standard ASR systems can be
severely limited or unavailable. In this paper, we investigate alternative
unsupervised solutions to obtaining tokenizations of speech in terms of a
vocabulary of automatically discovered word-like or phoneme-like units, without
depending on the supervised training of ASR systems. Moreover, using automatic
phoneme-like tokenizations, we demonstrate that a convolutional neural network
based framework for learning spoken document representations provides
competitive performance compared to a standard bag-of-words representation, as
evidenced by comprehensive topic ID evaluations on both single-label and
multi-label classification tasks.Comment: 5 pages, 2 figures; accepted for publication at Interspeech 201
Whittaker Modules for the Virasoro Algebra
Whittaker modules have been well studied in the setting of complex semisimple
Lie algebras. Their definition can easily be generalized to certain other Lie
algebras with triangular decomposition, including the Virasoro algebra. We
define Whittaker modules for the Virasoro algebra and obtain analogues to
several results from the classical setting, including a classification of
simple Whittaker modules by central characters and composition series for
general Whittaker modules.Comment: 14 pages; revised descriptions of references [4] and [5
Automatic Speech Recognition without Transcribed Speech or Pronunciation Lexicons
Rapid deployment of automatic speech recognition (ASR) in new languages, with very limited data, is of great interest and importance for intelligence gathering, as well as for humanitarian assistance and disaster relief (HADR). Deploying ASR systems in these languages often relies on cross-lingual acoustic modeling followed by supervised adaptation and almost always assumes that either a pronunciation lexicon using the International Phonetic Alphabet (IPA), and/or some amount of transcribed speech exist in the new language of interest. For many languages, neither requirement is generally true -- only a limited amount of text and untranscribed audio is available. This work focuses specifically on scalable techniques for building ASR systems in most languages without any existing transcribed speech or pronunciation lexicons.
We first demonstrate how cross-lingual acoustic model transfer, when phonemic pronunciation lexicons do exist in a new language, can significantly reduce the need for target-language transcribed speech. We then explore three methods for handling languages without a pronunciation lexicon. First we examine the effectiveness of graphemic acoustic model transfer, which allows for pronunciation lexicons to be trivially constructed. We then present two methods for rapid construction of phonemic pronunciation lexicons based on submodular selection of a small set of words for manual annotation, or words from other languages for which we have IPA pronunciations. We also explore techniques for training sequence-to-sequence models with very small amounts of data by transferring models trained on other languages, and leveraging large unpaired text corpora in training. Finally, as an alternative to acoustic model transfer, we present a novel hybrid generative/discriminative semi-supervised training framework that merges recent progress in Energy Based Models (EBMs) as well as lattice-free maximum mutual information (LF-MMI) training, capable of making use of purely untranscribed audio.
Together, these techniques enabled ASR capabilities that supported triage of spoken communications in real-world HADR work-flows in many languages using fewer than 30 minutes of transcribed speech. These techniques were successfully applied in multiple NIST evaluations and were among the top-performing systems in each evaluation
Towards Zero-Shot Code-Switched Speech Recognition
In this work, we seek to build effective code-switched (CS) automatic speech
recognition systems (ASR) under the zero-shot setting where no transcribed CS
speech data is available for training. Previously proposed frameworks which
conditionally factorize the bilingual task into its constituent monolingual
parts are a promising starting point for leveraging monolingual data
efficiently. However, these methods require the monolingual modules to perform
language segmentation. That is, each monolingual module has to simultaneously
detect CS points and transcribe speech segments of one language while ignoring
those of other languages -- not a trivial task. We propose to simplify each
monolingual module by allowing them to transcribe all speech segments
indiscriminately with a monolingual script (i.e. transliteration). This simple
modification passes the responsibility of CS point detection to subsequent
bilingual modules which determine the final output by considering multiple
monolingual transliterations along with external language model information. We
apply this transliteration-based approach in an end-to-end differentiable
neural network and demonstrate its efficacy for zero-shot CS ASR on
Mandarin-English SEAME test sets.Comment: 5 page
- …